A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# warnings issues
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer,
)
# Set standard styling for visualizations
# writer has slight vision deficiency, these features assist
custom_palette = sns.color_palette('colorblind')
sns.set(rc={'grid.color': 'gray', 'grid.alpha': 0.5})
sns.palplot(custom_palette)
# Paper context selected for readability in turned in .html format,
# but was originally written in noteebook context
sns.set(style='whitegrid', context='paper', palette= custom_palette)
# Set standard styling for charts and numeric displays
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
pd.set_option('display.width', 1000)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
These have essentially entered my personal library from previous projects - much like I assume a company would have branding guidelines on their data visualizations. There is nothing really new here.
The new custom functions around model building are in that section.
def print_outliers_info(data, feature):
"""
Calculates upper outlier analysis information to pair with visualizations
data: dataframe
feature: dataframe column
"""
Q1 = data[feature].quantile(0.25)
Q3 = data[feature].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
max_feat = data[feature].max()
outliers = data[(data[feature] > upper_bound)][feature].unique()
outliers_sorted = np.sort(outliers)
if len(outliers_sorted) > 6:
outliers_sorted_abbr = np.append(outliers_sorted[:6], "...etc")
else:
outliers_sorted_abbr = outliers_sorted
if len(outliers) > 0:
outlier_df = pd.DataFrame({
"IQR": [IQR],
"Q3": [Q3],
"Upper Bound": [upper_bound],
"Max" : [max_feat],
"#rows > Upper Bound": [len(data[data[feature] > upper_bound])]
})
print(f"{feature} Outliers Information:\n")
formatted_df = outlier_df.to_string(index=False, col_space=15, justify='left') + '\n'
print(formatted_df)
print(f"Unique Values Above Upper Bound: {outliers_sorted_abbr}")
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
# creating the 2 subplots
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75), "hspace": 0.05, "top": 0.95},
figsize=figsize,
)
# create a title
f2.suptitle(f"Histogram and Boxplot for {feature}", fontsize=16)
# boxplot will be created and a square will indicate the mean value of the column
create_boxplot(data, feature, ax_box2)
# create histogram, with consideration of bins
create_histogram(data, feature, ax_hist2, kde, bins)
# Calculate and print outliers information for the specific feature
print_outliers_info(data, feature)
def create_boxplot(data, feature, ax_box):
sns.boxplot(
data=data,
x=feature,
ax=ax_box,
showmeans=True,
meanprops={"marker": "s", "markersize": 8, "markerfacecolor": custom_palette[1], "markeredgecolor": "black"},
medianprops={'linewidth': 4},
color=custom_palette[2]
)
ax_box.set_xlabel("")
def create_histogram(data, feature, ax_hist, kde, bins):
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist, bins=bins, alpha=0.7
) if bins is not None else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist, alpha=0.7
)
add_mean_median_to_histogram(data, feature, ax_hist)
def add_mean_median_to_histogram(data, feature, ax_hist):
ax_hist.axvline(
data[feature].mean(), color='black', linestyle='-', linewidth=8
)
ax_hist.axvline(
data[feature].mean(), color=custom_palette[1], linestyle='-', linewidth=5, label="Mean"
)
ax_hist.axvline(
data[feature].median(), color='black', linestyle='-', linewidth=5, label="Median"
)
ax_hist.legend(loc='upper right')
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None, rotation=90, sort_index=False):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
sort_index: whether to sort the index (default is False)
"""
# Check data type of the column
if data[feature].dtype == 'O':
print(f"Skipping outlier analysis for {feature} as it contains string values.")
elif np.issubdtype(data[feature].dtype, np.number): # Check if dtype is numeric
# Perform some action for numeric dtype
print(f"Performing numeric-specific action for {feature}")
# Calculate and print outliers information for the specific feature
print_outliers_info(data, feature)
else:
print(f"Unsupported dtype for {feature}.")
# Plot the barplot
plot_barplot(data, feature, perc, n, rotation, sort_index)
def plot_barplot(data, feature, perc, n, rotation, sort_index):
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=rotation, fontsize=15)
order = data[feature].value_counts().index[:n]
if sort_index:
order = sorted(order)
ax = create_countplot(data, feature, order)
for p in ax.patches:
add_percentage_label(p, total, perc)
# Add title
plt.title(f"Barplot for {feature}")
plt.show() # show the plot
def create_countplot(data, feature, order):
return sns.countplot(
data=data,
x=feature,
palette=custom_palette,
order=order,
)
def add_percentage_label(p, total, perc):
if perc:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
plt.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
# This will be used as:
# dups_by_target(data, 'booking_status', 'Not_Canceled', 'Canceled')
# for this dataset
def dups_by_target(data, target, pos_value, neg_value):
"""
Prints duplicate data for a given target column with positive and negative values.
Parameters:
- data: DataFrame
- target: str, the column to analyze for duplicates
- pos_value: str, the positive value for the target column
- neg_value: str, the negative value for the target column
"""
# Find duplicates for positive value
pos_dupes = data[data[target] == pos_value].duplicated().sum()
total_pos = len(data[data[target] == pos_value])
pos_dupes_perc = (pos_dupes / total_pos) * 100
pos_duplicates = data[data[target] == pos_value][data[data[target] == pos_value].duplicated(keep=False)]
unique_sets_pos = pos_duplicates.groupby(list(pos_duplicates.columns)).size().reset_index(name='counts')
# Find duplicates for negative value
neg_dupes = data[data[target] == neg_value].duplicated().sum()
total_neg = len(data[data[target] == neg_value])
neg_dupes_perc = (neg_dupes / total_neg) * 100
neg_duplicates = data[data[target] == neg_value][data[data[target] == neg_value].duplicated(keep=False)]
unique_sets_neg = neg_duplicates.groupby(list(neg_duplicates.columns)).size().reset_index(name='counts')
# Create DataFrames for positive and negative data
pos_df = pd.DataFrame({
"Status": [pos_value],
"Duplicate Count": [pos_dupes],
"Percentage": [f"{pos_dupes_perc:.2f}%"],
"Unique Sets Count": [len(unique_sets_pos)],
"Max Count in Sets": [unique_sets_pos["counts"].max()]
})
neg_df = pd.DataFrame({
"Status": [neg_value],
"Duplicate Count": [neg_dupes],
"Percentage": [f"{neg_dupes_perc:.2f}%"],
"Unique Sets Count": [len(unique_sets_neg)],
"Max Count in Sets": [unique_sets_neg["counts"].max()]
})
# Concatenate the DataFrames
summary_df = pd.concat([pos_df, neg_df], ignore_index=True)
# Print the summary DataFrame
print(summary_df)
def stacked_barplot(data, predictor, target, rotation=90, sort_columns=True):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
rotation: rotation angle for x-axis labels (default is 90)
sort_columns: whether to sort columns by values in the predictor/x-axis (default is True)
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
if sort_columns:
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
else:
tab = pd.crosstab(data[predictor], data[target], normalize="index")
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.xticks(rotation=rotation, fontsize=15)
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color=custom_palette[2],
alpha=0.7,
stat="density",
line_kws={"color": "black"}
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color=custom_palette[3],
alpha=0.7,
stat="density",
line_kws={"color": "black"}
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 0]
)
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
)
plt.tight_layout()
plt.show()
# read the data
from google.colab import files
import io
try:
uploaded
except NameError:
uploaded = files.upload()
hotel= pd.read_csv(io.BytesIO(uploaded['INNHotelsGroup.csv']))
Saving INNHotelsGroup.csv to INNHotelsGroup.csv
# copying data to another variable to avoid any changes to original data
data = hotel.copy()
data.head() ## view top 5 rows of the data
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
data.tail() ## view last 5 rows of the data
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.80000 | 1 | Not_Canceled |
| 36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.95000 | 2 | Canceled |
| 36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.39000 | 2 | Not_Canceled |
| 36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
| 36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67000 | 0 | Not_Canceled |
data.shape ## view dimensions of the data
(36275, 19)
data.info() ## view datatypes for each column, preliminary missing value check
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
# checking for duplicate values
data.duplicated().sum()
0
Let's drop the Booking_ID column first before we proceed forward.
data = data.drop(['Booking_ID'], axis = 1) ## Drop the Booking_ID column from the dataframe
dups_by_target(data, 'booking_status', 'Not_Canceled', 'Canceled')
Status Duplicate Count Percentage Unique Sets Count Max Count in Sets 0 Not_Canceled 5832 23.91% 2093 91 1 Canceled 4443 37.38% 1045 83
This is not super surprising, as hotels will have the same room type with very common booking arrangements. It is good to know going in, that of the Canceled data, ~37% of it is in duplicated data, with at least one unique set being quite large at 83 duplicated rows. We can check this again after outlier work to see if the number of duplicated values increases substantially.
data.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 1 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled |
| 2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 4 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
The EDA sections below were kept in this location instead of in an appendix due to the amount of outlier treatment conducted as part of the EDA. This section is meant to be a complete summary such that opening the rest of the subsections of the EDA are only as needed.
Booking_ID:
Booking Status:
Number of Adults:
Number of Children:
Number of Weekend Nights:
Number of Week Nights:
Type of Meal Plan:
Required Car Parking Space:
Room Type Reserved:
Lead Time:
Arrival Year and Month:
Arrival Date:
Market Segment Type:
Repeated Guest:
Previous Cancellations and Bookings Not Canceled:
Average Price per Room:
Let's check the statistical summary of the data.
data.describe().T ## print the statistical summary of the data
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.00000 | 1.84496 | 0.51871 | 0.00000 | 2.00000 | 2.00000 | 2.00000 | 4.00000 |
| no_of_children | 36275.00000 | 0.10528 | 0.40265 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 10.00000 |
| no_of_weekend_nights | 36275.00000 | 0.81072 | 0.87064 | 0.00000 | 0.00000 | 1.00000 | 2.00000 | 7.00000 |
| no_of_week_nights | 36275.00000 | 2.20430 | 1.41090 | 0.00000 | 1.00000 | 2.00000 | 3.00000 | 17.00000 |
| required_car_parking_space | 36275.00000 | 0.03099 | 0.17328 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| lead_time | 36275.00000 | 85.23256 | 85.93082 | 0.00000 | 17.00000 | 57.00000 | 126.00000 | 443.00000 |
| arrival_year | 36275.00000 | 2017.82043 | 0.38384 | 2017.00000 | 2018.00000 | 2018.00000 | 2018.00000 | 2018.00000 |
| arrival_month | 36275.00000 | 7.42365 | 3.06989 | 1.00000 | 5.00000 | 8.00000 | 10.00000 | 12.00000 |
| arrival_date | 36275.00000 | 15.59700 | 8.74045 | 1.00000 | 8.00000 | 16.00000 | 23.00000 | 31.00000 |
| repeated_guest | 36275.00000 | 0.02564 | 0.15805 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| no_of_previous_cancellations | 36275.00000 | 0.02335 | 0.36833 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 13.00000 |
| no_of_previous_bookings_not_canceled | 36275.00000 | 0.15341 | 1.75417 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.00000 |
| avg_price_per_room | 36275.00000 | 103.42354 | 35.08942 | 0.00000 | 80.30000 | 99.45000 | 120.00000 | 540.00000 |
| no_of_special_requests | 36275.00000 | 0.61966 | 0.78624 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 5.00000 |
Lets start with the Target Variable. We will encode Canceled bookings to 1 and Not_Canceled as 0 when we prep for modeling, but lets keep it categorical for analysis.
labeled_barplot(data, "booking_status", perc= True, rotation = 0)
Skipping outlier analysis for booking_status as it contains string values.
As we look at outliers below, if they can keep this approximate 2/3 Not_Canceled and 1/3 Canceled Status that is great.
histogram_boxplot(data, 'lead_time')
lead_time Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 109.00000 126.00000 289.50000 443 1331 Unique Values Above Upper Bound: ['290' '291' '292' '293' '294' '295' '...etc']
I am unconcerned about the 0 values; that is people showing up to the hotels because the vacancy sign is on or otherwise change in travel plans. The upper outliers represent 3.6% of the data so lets pull some of those in. It looks like the density of outliers drops off around 350, so lets round up to a full year of 365 days. This creates a data bucket of "greater than one year lead time".
data.loc[data["lead_time"] >= 365, "lead_time"] = 365
histogram_boxplot(data, 'avg_price_per_room', bins = 25)
avg_price_per_room Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 39.70000 120.00000 179.55000 540.00000 1069 Unique Values Above Upper Bound: ['179.71' '179.92' '180.0' '180.16' '180.2' '180.25' '...etc']
The outliers on both ends are worth investigation. The 0 Euro values for some rooms might be exlplained by toom type or market segment, while we may need to decide on a specific value for the upper bound for outliers. The one room over 500 can go, but what about everything else?
print_outliers_info(data, "avg_price_per_room")
avg_price_per_room Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 39.70000 120.00000 179.55000 540.00000 1069 Unique Values Above Upper Bound: ['179.71' '179.92' '180.0' '180.16' '180.2' '180.25' '...etc']
Lets look at the distribution of data above the upper bound of 179, this will help us determine a "cut off point" that has the correct impact on analysis.
price_upper_bound = 179.55
price_over_ubound = data[data['avg_price_per_room'] >= price_upper_bound]
histogram_boxplot(price_over_ubound, 'avg_price_per_room')
avg_price_per_room Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 29.86000 216.90000 261.69000 540.00000 42 Unique Values Above Upper Bound: ['262.7' '263.55' '263.91' '264.1' '265.0' '265.44' '...etc']
I see that the low code notebook suggests only cutting out the one data point greater than 500 Euros. Howevere, since the 1069 outlier room prices is 3% of the data set, and since I lack a subject matter expert, I am instead choosing to replace more outliers than that. Looking at the distribution of outliers, I will instead choose to replace all values above 261.69 (which is the upper bound of the outlier distribution).
# assigning the outliers with the value of 261.69
price_upper_bound2 = 261.69
data.loc[data["avg_price_per_room"] >= price_upper_bound2, "avg_price_per_room"] = price_upper_bound2
Now lets take a look at the 0 - values to assess validty/sanity.
data[data["avg_price_per_room"] == 0]
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 2 | 2017 | 9 | 10 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| 145 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 13 | 2018 | 6 | 1 | Complementary | 1 | 3 | 5 | 0.00000 | 1 | Not_Canceled |
| 209 | 1 | 0 | 0 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2018 | 2 | 27 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| 266 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2017 | 8 | 12 | Complementary | 1 | 0 | 1 | 0.00000 | 1 | Not_Canceled |
| 267 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2017 | 8 | 23 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35983 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 6 | 7 | Complementary | 1 | 4 | 17 | 0.00000 | 1 | Not_Canceled |
| 36080 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 3 | 21 | Complementary | 1 | 3 | 15 | 0.00000 | 1 | Not_Canceled |
| 36114 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 3 | 2 | Online | 0 | 0 | 0 | 0.00000 | 0 | Not_Canceled |
| 36217 | 2 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 2 | 3 | 2017 | 8 | 9 | Online | 0 | 0 | 0 | 0.00000 | 2 | Not_Canceled |
| 36250 | 1 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 6 | 2017 | 12 | 10 | Online | 0 | 0 | 0 | 0.00000 | 0 | Not_Canceled |
545 rows × 18 columns
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
Complementary 354 Online 191 Name: market_segment_type, dtype: int64
The zero value avg price per rooms are all comps, or possibly online comps or card rewards. These are valid data points so we will leave them as is!
Lets look at our final distribution:
histogram_boxplot(data, 'avg_price_per_room', bins = 25)
avg_price_per_room Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 39.70000 120.00000 179.55000 261.69000 1069 Unique Values Above Upper Bound: ['179.71' '179.92' '180.0' '180.16' '180.2' '180.25' '...etc']
The number of outliers did not change, simply the magnitude of the outlier-ness of them. I think this will affect the analysis correctly; pulling the distribution a little closer to normal and honoring that there are "high dollar rooms". I anticipate that , at least in the giant tree, there will be a price node. I can watch to see what that price is to see if my outlier treatment was too agreesive.
histogram_boxplot(data, 'no_of_previous_cancellations')
no_of_previous_cancellations Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 0.00000 0.00000 0.00000 13 338 Unique Values Above Upper Bound: ['1' '2' '3' '4' '5' '6' '...etc']
No removal of outliers here - all these values are valid. There is simply a large numbr of 0 previous cacnelations. Lets look at some patterns in the > 0 cancellation data.
data.loc[data['no_of_previous_cancellations'] > 0, "market_segment_type"].value_counts()
Corporate 176 Offline 68 Online 55 Complementary 36 Aviation 3 Name: market_segment_type, dtype: int64
Most of previous cancellations are from the corporate sector
data.loc[data['booking_status'] == "Canceled", "market_segment_type"].value_counts()
Online 8475 Offline 3153 Corporate 220 Aviation 37 Name: market_segment_type, dtype: int64
Wheras the data point status cancellations are mostly from online bookings.
histogram_boxplot(data, 'no_of_previous_bookings_not_canceled')
no_of_previous_bookings_not_canceled Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 0.00000 0.00000 0.00000 58 812 Unique Values Above Upper Bound: ['1' '2' '3' '4' '5' '6' '...etc']
Similarly will leave these outliers. The 58 prior not cancellations is a frequent travelor or company and they are an important feature of this model.
Lets look at the Booking Status split for users that are new customers (such that both previous cancellation and not canceled are both 0)
data.loc[(data['no_of_previous_cancellations'] == 0) & (data['no_of_previous_bookings_not_canceled'] == 0),"booking_status"].value_counts(normalize = True)
Not_Canceled 0.66420 Canceled 0.33580 Name: booking_status, dtype: float64
This is essentially a miniature decision tree - one that may end up being relevatnt to our final model or nnot. This is slight feature engineering in a new dataframe, will decide later to carry over to primary dataframe or not.
data2 = data.copy()
# Add new columns that show previous status as yes/no instead of counts
data2['previously_cancelled'] = np.where(data2['no_of_previous_cancellations'] > 0, 'yes', 'no')
data2['previous_stayed'] = np.where(data2['no_of_previous_bookings_not_canceled'] > 0, 'yes', 'no')
# Group by previously_cancelled and previous_stayed, then count by booking_status
result = data2.groupby(['previously_cancelled', 'previous_stayed', 'booking_status']).size().reset_index(name='count')
print(result)
previously_cancelled previous_stayed booking_status count 0 no no Canceled 11869 1 no no Not_Canceled 23476 2 no yes Not_Canceled 592 3 yes no Canceled 9 4 yes no Not_Canceled 109 5 yes yes Canceled 7 6 yes yes Not_Canceled 213
# Group by previously_cancelled and previous_stayed, then count by booking_status
result = data2.groupby(['booking_status','previously_cancelled', 'previous_stayed']).size().reset_index(name='count')
print(result)
booking_status previously_cancelled previous_stayed count 0 Canceled no no 11869 1 Canceled yes no 9 2 Canceled yes yes 7 3 Not_Canceled no no 23476 4 Not_Canceled no yes 592 5 Not_Canceled yes no 109 6 Not_Canceled yes yes 213
This shows us that 1/3 of new customer bookings (in charts as no/no for previous statuses) have cancelled.
Further, all 592 customers who are (no cancellations / yes returning customer) do not cancel. That sounds like a potentially pure node, but that size of data set is small.
Belatedly realizing this column exists, but will keep the previous analysis in the cancelled vs not cancelled section. Yes = 1 and No = 0 in this data. This column counts a repeated guest as anyone who has booked before, even if they cancelled. So the above info might actually be stronger.
labeled_barplot(data, "repeated_guest", perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for repeated_guest repeated_guest Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 0.00000 0.00000 0.00000 1 930 Unique Values Above Upper Bound: [1]
# Group by previously_cancelled and previous_stayed, then count by booking_status
result = data2.groupby(['booking_status','previously_cancelled', 'previous_stayed', 'repeated_guest']).size().reset_index(name='count')
print(result)
booking_status previously_cancelled previous_stayed repeated_guest count 0 Canceled no no 0 11869 1 Canceled yes no 1 9 2 Canceled yes yes 1 7 3 Not_Canceled no no 0 23476 4 Not_Canceled no yes 1 592 5 Not_Canceled yes no 1 109 6 Not_Canceled yes yes 1 213
labeled_barplot(data, "no_of_adults", perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_adults no_of_adults Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 0.00000 2.00000 2.00000 4 2333 Unique Values Above Upper Bound: [3 4]
The "outliers" of number of adults are 2 customer lines, so will leave them for now.
labeled_barplot(data, "no_of_children", perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_children no_of_children Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 0.00000 0.00000 0.00000 10 2698 Unique Values Above Upper Bound: [ 1 2 3 9 10]
data.loc[(data['no_of_children'] > 2),"booking_status"].value_counts()
Not_Canceled 16 Canceled 6 Name: booking_status, dtype: int64
We can see that the 2/3 Not_Cancelled rate holds for these 22 values, so replace the 9 and 10 children values with 3s. This creates a "greater than 2" category.
# treate values of 3, 9, and 10 as "greater than 2 children"
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
labeled_barplot(data, 'no_of_week_nights', perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_week_nights no_of_week_nights Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 2.00000 3.00000 6.00000 17 324 Unique Values Above Upper Bound: ['7' '8' '9' '10' '11' '12' '...etc']
The 0 weeknight category is the customers who only stay on weekends (Saturday or Sunday).
The 6+ category of data is customers who stay into two calendar weeks, across a weekened. Lets combine all the 6+ data into the 6.
# treat all customers who stayed over two calendar weeks as the same
data["no_of_week_nights"] = data["no_of_week_nights"].apply(lambda x: min(x, 6))
Lets check the assumption that these 6+ weekday stays include a weekend:
data[data["no_of_week_nights"] == 6].groupby("no_of_weekend_nights").size().reset_index(name='count')
| no_of_weekend_nights | count | |
|---|---|---|
| 0 | 2 | 246 |
| 1 | 3 | 100 |
| 2 | 4 | 112 |
| 3 | 5 | 34 |
| 4 | 6 | 20 |
| 5 | 7 | 1 |
This looks good to me; all of these customers have at least 2 weekend nights in their stays. We can confirm that the 6+ weekday stay customers are all multi calendar week stay customers.
labeled_barplot(data, 'no_of_weekend_nights', perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_weekend_nights no_of_weekend_nights Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 2.00000 2.00000 5.00000 7 21 Unique Values Above Upper Bound: [6 7]
# treat all customers who stayed over three calendar weekends the same
data["no_of_weekend_nights"] = data["no_of_weekend_nights"].apply(lambda x: min(x, 5))
labeled_barplot(data, 'required_car_parking_space', perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for required_car_parking_space required_car_parking_space Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 0.00000 0.00000 0.00000 1 1124 Unique Values Above Upper Bound: [1]
labeled_barplot(data, 'type_of_meal_plan', perc=True, rotation = 45)
Skipping outlier analysis for type_of_meal_plan as it contains string values.
data['type_of_meal_plan'].value_counts()
Meal Plan 1 27835 Not Selected 5130 Meal Plan 2 3305 Meal Plan 3 5 Name: type_of_meal_plan, dtype: int64
Meal plan 3 is closest to Meal plan 2. But Ill leave it as a seperate categorical for now. With the value so small, I can't imagine it will appear in the final tree unless its "Meal Plan 1" vs "Not Meal Plan 1".
labeled_barplot(data, 'room_type_reserved', perc=True, rotation = 45)
Skipping outlier analysis for room_type_reserved as it contains string values.
labeled_barplot(data, 'arrival_month', perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for arrival_month
# grouping the data on arrival months and extracting the count of bookings
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()
# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
{"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)
# plotting the trend over different months
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.show()
These effectively communicate the same thing, but we rarely get to use lineplots in this class so it was nice to include.
#Lets start this feature building a little early so we can see cancellations by month as well
data["booking_status"] = data["booking_status"].apply(
lambda x: 1 if x == "Canceled" else 0
)
stacked_barplot(data, "arrival_month", "booking_status", rotation = 0, sort_columns = False) ## Complete the code to plot stacked barplot for arrival month and booking status
booking_status 0 1 All arrival_month All 24390 11885 36275 10 3437 1880 5317 9 3073 1538 4611 8 2325 1488 3813 7 1606 1314 2920 6 1912 1291 3203 4 1741 995 2736 5 1650 948 2598 11 2105 875 2980 3 1658 700 2358 2 1274 430 1704 12 2619 402 3021 1 990 24 1014 ------------------------------------------------------------------------------------------------------------------------
There more cancelations in the Northern Hemisphere "Summer Months" while companies are often lenient on leave and the children are largely nto in school.
labeled_barplot(data, "market_segment_type", rotation = 45, perc=True)
Skipping outlier analysis for market_segment_type as it contains string values.
labeled_barplot(data, 'no_of_special_requests', perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_special_requests no_of_special_requests Outliers Information: IQR Q3 Upper Bound Max #rows > Upper Bound 1.00000 1.00000 2.50000 5 761 Unique Values Above Upper Bound: [3 4 5]
Outlier Treatment; combine the 3,4,and 5 sepecial requests bookings. This is because the upper bound is 2.5 and relatively small percentage of data points are beyond that point.
data.loc[data["no_of_special_requests"] > 3, "no_of_special_requests"] = 3
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="cividis"
)
plt.show()
There are not any surprises in this correlation matrix. There is additional work below to look at the smaller correlation values, but here are the big ones:
Positive Correlations
Negative Correlations
Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments
plt.figure(figsize=(10, 6))
sns.boxplot(
data=data, x="market_segment_type", y="avg_price_per_room"
)
plt.show()
Let's see how booking status varies across different market segments. Also, how average price per room impacts booking status
# Reminder that "1" here is a Cancelation since that is the behavior we are modeling
stacked_barplot(data, "market_segment_type", "booking_status", rotation = 45)
booking_status 0 1 All market_segment_type All 24390 11885 36275 Online 14739 8475 23214 Offline 7375 3153 10528 Corporate 1797 220 2017 Aviation 88 37 125 Complementary 391 0 391 ------------------------------------------------------------------------------------------------------------------------
Many guests have special requirements when booking a hotel room. Let's see how it impacts cancellations
stacked_barplot(data, "no_of_special_requests", "booking_status", rotation = 0)
booking_status 0 1 All no_of_special_requests All 24390 11885 36275 0 11232 8545 19777 1 8670 2703 11373 2 3727 637 4364 3 761 0 761 ------------------------------------------------------------------------------------------------------------------------
The correlation matrix predicted this; a negative correlation between special requests and cancellations (where cancellations = 1). As the number of requests increased, so did likely hood of Not Cancelling. This seems to indicate a level of investment.
Let's see if the special requests made by the customers impacts the prices of a room
plt.figure(figsize=(10, 5))
sns.boxplot(data = data, x = 'no_of_special_requests', y = 'avg_price_per_room') ## Complete the code to create boxplot for no of special requests and average price per room (excluding the outliers)
plt.show()
We saw earlier that there is a positive correlation between booking status and average price per room. Let's analyze it
distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")
We see that the average price per room is slightly greater in the case of cancellations.
There is a positive correlation between booking status and lead time also. Let's analyze it further
distribution_plot_wrt_target(data, 'lead_time', 'booking_status') ## Complete the code to find distribution of lead time wrt booking status
Greater lead times generally correlate with more cancellations.
Generally people travel with their spouse and children for vacations or other activities. Let's create a new dataframe of the customers who traveled with their families and analyze the impact on booking status.
family_data = data[(data["no_of_children"] >= 0) & (data["no_of_adults"] > 1)]
family_data.shape
(28441, 18)
pd.options.mode.chained_assignment = None # default='warn'
family_data.loc[:, "no_of_family_members"] = family_data["no_of_adults"] + family_data["no_of_children"]
stacked_barplot(family_data, 'no_of_family_members', 'booking_status', rotation = 0, sort_columns = False) ## Complete the code to plot stacked barplot for no of family members and booking status
booking_status 0 1 All no_of_family_members All 18456 9985 28441 2 15506 8213 23719 3 2425 1368 3793 4 514 398 912 5 11 6 17 ------------------------------------------------------------------------------------------------------------------------
There are slightly more cancelations as the family increases from 2 to 4 people, but then a family of 5 and a family of 2 have similar rates.
Let's do a similar analysis for the customer who stay for at least a day at the hotel.
stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
stay_data.shape
(17094, 18)
stay_data["total_days"] = (
stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"]
)
stacked_barplot(stay_data, "total_days", "booking_status", rotation = 0, sort_columns = False) ## Complete the code to plot stacked barplot for total days and booking status
booking_status 0 1 All total_days All 10979 6115 17094 3 3689 2183 5872 4 2977 1387 4364 5 1593 738 2331 2 1301 639 1940 6 566 465 1031 7 590 383 973 8 157 142 299 10 36 76 112 9 61 56 117 11 9 46 55 ------------------------------------------------------------------------------------------------------------------------
Generally, as the number of days increases there is an increase in cancellations, but truly the pattern isnt consistent until total days is greater than 8. This is including the outlier work we did earlier; such that the 11 total days is truncated.
Repeating guests are the guests who stay in the hotel often and are important to brand equity. Let's see what percentage of repeating guests cancel?
stacked_barplot(data, "repeated_guest", "booking_status", rotation = 0, sort_columns = False) ## Complete the code to plot stacked barplot for repeated guests and booking status
booking_status 0 1 All repeated_guest All 24390 11885 36275 0 23476 11869 35345 1 914 16 930 ------------------------------------------------------------------------------------------------------------------------
We have seen this a few ways; generally repeated guests are not cancelleing.
As hotel room prices are dynamic, Let's see how the prices vary across different months
plt.figure(figsize=(10, 5))
sns.lineplot(data, x= 'arrival_month', y = 'avg_price_per_room') ## Complete the code to create lineplot between average price per room and arrival month
plt.show()
The "summer months" in the northern hemisphere is from June to August. This is when most students are not in school - it makes sense that there is increased travel and thus higher prices in those months.
pd.crosstab(data['market_segment_type'], data['room_type_reserved']).plot(kind='bar', stacked=True)
plt.show()
These boxplots for outlier detection are a little choppy as compared to the analysis above, but included for thoroughness. There are clearly still some outliers, but I am confident in the previous treatments.
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
As we said at the start, lets take one last look at duplicates to see if they increased with the outlier treatments:
dups_by_target(data, 'booking_status', 0,1)
Status Duplicate Count Percentage Unique Sets Count Max Count in Sets 0 0 5833 23.92% 2093 91 1 1 4443 37.38% 1045 83
We actually do not see much change in this data! That means the outlier treatments did not create new "duplicate reservation types". That might mean that outlier work was pointless, but lets carry on.
F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
# Create Independent and Dependent Variables
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
# Add a constant to X
X = sm.add_constant(X)
# Create dummy variables for categorical columns in X
X = pd.get_dummies(X, drop_first=True)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
# This is a quick check to make sure that our class distribution is equal across the train and test sets
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (25392, 28) Shape of test set : (10883, 28) Percentage of classes in training set: 0 0.67064 1 0.32936 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.67638 1 0.32362 Name: booking_status, dtype: float64
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
vif_df = checking_vif(X_train).sort_values(by = "VIF")
vif_df
| feature | VIF | |
|---|---|---|
| 19 | room_type_reserved_Room_Type 3 | 1.00330 |
| 9 | arrival_date | 1.00669 |
| 16 | type_of_meal_plan_Meal Plan 3 | 1.02524 |
| 21 | room_type_reserved_Room_Type 5 | 1.02825 |
| 5 | required_car_parking_space | 1.04036 |
| 3 | no_of_weekend_nights | 1.05479 |
| 4 | no_of_week_nights | 1.09185 |
| 18 | room_type_reserved_Room_Type 2 | 1.10609 |
| 23 | room_type_reserved_Room_Type 7 | 1.11625 |
| 14 | no_of_special_requests | 1.25137 |
| 15 | type_of_meal_plan_Meal Plan 2 | 1.27242 |
| 17 | type_of_meal_plan_Not Selected | 1.27535 |
| 8 | arrival_month | 1.27656 |
| 1 | no_of_adults | 1.35365 |
| 20 | room_type_reserved_Room_Type 4 | 1.36521 |
| 11 | no_of_previous_cancellations | 1.39560 |
| 6 | lead_time | 1.39953 |
| 7 | arrival_year | 1.43187 |
| 12 | no_of_previous_bookings_not_canceled | 1.65182 |
| 10 | repeated_guest | 1.78366 |
| 22 | room_type_reserved_Room_Type 6 | 2.05651 |
| 13 | avg_price_per_room | 2.07307 |
| 2 | no_of_children | 2.09439 |
| 24 | market_segment_type_Complementary | 4.50122 |
| 25 | market_segment_type_Corporate | 16.91748 |
| 26 | market_segment_type_Offline | 64.08484 |
| 27 | market_segment_type_Online | 71.16080 |
| 0 | const | 39496591.94636 |
The only high VIFs are in the categoricals. This is because certain Markets only reserve certain room types. This will drp away when we do preformance checks.
# Fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float)) #increased the interation due to p-value work
lg = logit.fit(disp = False)
# Printing summary of the model
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25364
Method: MLE Df Model: 27
Date: Sat, 30 Sep 2023 Pseudo R-squ.: 0.3291
Time: 09:14:54 Log-Likelihood: -10796.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -922.7827 120.975 -7.628 0.000 -1159.889 -685.676
no_of_adults 0.1119 0.038 2.968 0.003 0.038 0.186
no_of_children 0.1587 0.062 2.555 0.011 0.037 0.280
no_of_weekend_nights 0.1140 0.020 5.803 0.000 0.076 0.153
no_of_week_nights 0.0156 0.014 1.153 0.249 -0.011 0.042
required_car_parking_space -1.5980 0.138 -11.600 0.000 -1.868 -1.328
lead_time 0.0158 0.000 59.114 0.000 0.015 0.016
arrival_year 0.4561 0.060 7.608 0.000 0.339 0.574
arrival_month -0.0419 0.006 -6.467 0.000 -0.055 -0.029
arrival_date 0.0005 0.002 0.280 0.779 -0.003 0.004
repeated_guest -2.3610 0.618 -3.817 0.000 -3.573 -1.149
no_of_previous_cancellations 0.2658 0.086 3.094 0.002 0.097 0.434
no_of_previous_bookings_not_canceled -0.1724 0.153 -1.128 0.259 -0.472 0.127
avg_price_per_room 0.0189 0.001 25.514 0.000 0.017 0.020
no_of_special_requests -1.4709 0.030 -48.892 0.000 -1.530 -1.412
type_of_meal_plan_Meal Plan 2 0.1735 0.067 2.607 0.009 0.043 0.304
type_of_meal_plan_Meal Plan 3 27.2852 1.6e+05 0.000 1.000 -3.13e+05 3.13e+05
type_of_meal_plan_Not Selected 0.2753 0.053 5.183 0.000 0.171 0.379
room_type_reserved_Room_Type 2 -0.3620 0.131 -2.757 0.006 -0.619 -0.105
room_type_reserved_Room_Type 3 -0.0182 1.314 -0.014 0.989 -2.593 2.557
room_type_reserved_Room_Type 4 -0.2783 0.053 -5.226 0.000 -0.383 -0.174
room_type_reserved_Room_Type 5 -0.7202 0.209 -3.439 0.001 -1.131 -0.310
room_type_reserved_Room_Type 6 -0.9472 0.151 -6.262 0.000 -1.244 -0.651
room_type_reserved_Room_Type 7 -1.3507 0.292 -4.627 0.000 -1.923 -0.779
market_segment_type_Complementary -28.1577 1.6e+05 -0.000 1.000 -3.13e+05 3.13e+05
market_segment_type_Corporate -1.2256 0.264 -4.634 0.000 -1.744 -0.707
market_segment_type_Offline -2.2291 0.253 -8.811 0.000 -2.725 -1.733
market_segment_type_Online -0.4277 0.250 -1.713 0.087 -0.917 0.062
========================================================================================================
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80541 | 0.63219 | 0.73923 | 0.68153 |
confusion_matrix_statsmodels(lg, X_train, y_train)
# initial list of columns
predictors = X_train.copy()
cols = predictors.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
X_train_aux = predictors[cols]
# fitting the model
model = sm.OLS(y_train, X_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Complementary', 'market_segment_type_Corporate', 'market_segment_type_Offline']
^ That is the list of features that are staying after that analysis.
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
logit1 = sm.Logit(y_train, X_train1)
lg1 = logit1.fit(disp=False) #cant get google collab to increase iterations, so it is stopping at 35
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25372
Method: MLE Df Model: 19
Date: Sat, 30 Sep 2023 Pseudo R-squ.: 0.3279
Time: 09:14:56 Log-Likelihood: -10814.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -861.2730 116.634 -7.384 0.000 -1089.871 -632.675
no_of_adults 0.1082 0.037 2.898 0.004 0.035 0.181
no_of_children 0.1569 0.062 2.529 0.011 0.035 0.279
no_of_weekend_nights 0.1174 0.020 6.014 0.000 0.079 0.156
required_car_parking_space -1.6078 0.138 -11.678 0.000 -1.878 -1.338
lead_time 0.0160 0.000 61.092 0.000 0.015 0.016
arrival_year 0.4254 0.058 7.360 0.000 0.312 0.539
arrival_month -0.0444 0.006 -6.931 0.000 -0.057 -0.032
no_of_previous_bookings_not_canceled -0.6621 0.213 -3.115 0.002 -1.079 -0.246
avg_price_per_room 0.0194 0.001 27.254 0.000 0.018 0.021
no_of_special_requests -1.4694 0.030 -49.011 0.000 -1.528 -1.411
type_of_meal_plan_Not Selected 0.2700 0.053 5.109 0.000 0.166 0.374
room_type_reserved_Room_Type 2 -0.3674 0.131 -2.796 0.005 -0.625 -0.110
room_type_reserved_Room_Type 4 -0.2803 0.053 -5.314 0.000 -0.384 -0.177
room_type_reserved_Room_Type 5 -0.7280 0.209 -3.489 0.000 -1.137 -0.319
room_type_reserved_Room_Type 6 -0.9758 0.151 -6.479 0.000 -1.271 -0.681
room_type_reserved_Room_Type 7 -1.3927 0.292 -4.777 0.000 -1.964 -0.821
market_segment_type_Complementary -26.9785 1.19e+05 -0.000 1.000 -2.33e+05 2.33e+05
market_segment_type_Corporate -0.8522 0.103 -8.299 0.000 -1.053 -0.651
market_segment_type_Offline -1.7764 0.051 -35.122 0.000 -1.876 -1.677
========================================================================================================
Model 1 - lg1 now has no multicollinearity and only significant values. Lets analyze this again:
# converting coefficients to odds
odds = np.exp(lg1.params)
# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
odds_df = pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
print(odds_df)
const no_of_adults no_of_children no_of_weekend_nights required_car_parking_space lead_time arrival_year arrival_month no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Complementary market_segment_type_Corporate market_segment_type_Offline Odds 0.00000 1.11423 1.16990 1.12460 0.20033 1.01610 1.53022 0.95653 0.51576 1.01961 0.23007 1.30999 0.69256 0.75555 0.48290 0.37690 0.24839 0.00000 0.42648 0.16925 Change_odd% -100.00000 11.42258 16.99034 12.46016 -79.96676 1.61039 53.02158 -4.34691 -48.42417 1.96139 -76.99322 30.99875 -30.74447 -24.44507 -51.71047 -62.30984 -75.16054 -100.00000 -57.35221 -83.07512
top6_columns = odds_df.abs().sort_values(by="Change_odd%", axis=1, ascending=False).iloc[:, :6]
print(top6_columns)
const market_segment_type_Complementary market_segment_type_Offline required_car_parking_space no_of_special_requests room_type_reserved_Room_Type 7 Odds 0.00000 0.00000 0.16925 0.20033 0.23007 0.24839 Change_odd% 100.00000 100.00000 83.07512 79.96676 76.99322 75.16054
Six was chosen simply because the constant will be one of the colums - I am actually interested in the top 5 features that change the odds so I can better compare to the Decision Tree Visualization.
# creating confusion matrix
conf_matrix_default = confusion_matrix_statsmodels(lg1, X_train1, y_train)
# checking model performance on train set (seen 70% data)
print("Training Performance - Model 0")
lg_perf_train = model_performance_classification_statsmodels(lg, X_train, y_train)
print(lg_perf_train)
# checking model performance on test set (seen 30% data)
print("Test Performance- Model 0")
lg_perf_test = model_performance_classification_statsmodels(lg, X_test, y_test)
print(lg_perf_test,"\n\n")
# checking model performance on train set (seen 70% data)
print("Training Performance - Model 1")
lg1_perf_train1 = model_performance_classification_statsmodels(lg1, X_train1, y_train)
print(lg1_perf_train1)
# checking model performance on test set (seen 30% data)
print("Test Performance- Model 1")
lg_perf_test1= model_performance_classification_statsmodels(lg1, X_test1, y_test)
print(lg_perf_test1)
Training Performance - Model 0 Accuracy Recall Precision F1 0 0.80541 0.63219 0.73923 0.68153 Test Performance- Model 0 Accuracy Recall Precision F1 0 0.80428 0.63061 0.72820 0.67590 Training Performance - Model 1 Accuracy Recall Precision F1 0 0.80415 0.62920 0.73759 0.67910 Test Performance- Model 1 Accuracy Recall Precision F1 0 0.80401 0.62890 0.72838 0.67500
The improvement is in the consistency in the performance across train and test. In otherwords, we are no longer overfitting to the train data. I hope we can continue to improve that F1 score though, as 67% is also the rate of cancellation in the original data set.
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3832236744031388
# creating confusion matrix
conf_matrix_roc = confusion_matrix_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance AUC:")
log_reg_model_train_perf_threshold_auc_roc
Training performance AUC:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79620 | 0.72701 | 0.67766 | 0.70147 |
This shifted more of the error into the False positive - which in this case means we are over-predicting cancellations. The Accuracy and Precision decreased, byt the Recall and F1 increased.
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# I had to google this, I wanted to find the exact intersection.
def find_threshold_for_intersect(precisions, recalls, thresholds):
for i in range(len(precisions) - 1):
if precisions[i] == recalls[i]:
return thresholds[i]
# Find the threshold for intersection
intersect_threshold = find_threshold_for_intersect(prec, rec, tre)
print("Intersection Threshold:", intersect_threshold)
Intersection Threshold: 0.4222295680330016
Truly I need to minimize both, but operationally not having enough rooms for customers is a larger issue then rooms being empty (not that airlines agree with that assessment). Perfect Intersection would be 0.42, so lets compare that to the AUC-ROC threshold of 0.38
print("PR Threshold: 0.42 ")
conf_matrix_pr = confusion_matrix_statsmodels(lg1, X_train1, y_train, threshold=intersect_threshold)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=intersect_threshold
)
log_reg_model_train_perf_threshold_curve
PR Threshold: 0.42
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80041 | 0.69688 | 0.69705 | 0.69696 |
print("ROC-AUC Threshold: 0.38")
conf_matrix_roc = confusion_matrix_statsmodels(lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
log_reg_model_train_perf_threshold_curve
ROC-AUC Threshold: 0.38
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79620 | 0.72701 | 0.67766 | 0.70147 |
print("Default Threshold: 0.50")
conf_matrix_default = confusion_matrix_statsmodels(lg1, X_train1, y_train)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train
)
log_reg_model_train_perf_threshold_curve
Default Threshold: 0.50
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80415 | 0.62920 | 0.73759 | 0.67910 |
The Threshold that maximizes the F1 value, which we originally said was the goal would be the ROC-AUC curve threshold value of 0.38.
However, this is a bit of a tough call, and I would certainly want to talk to a SME about the balance of False Negative and False Positives. This threshold of 0.38 maximizes the F1, but ist also minimizes precision. The False Positive rate is 11.39% with this threshold. In striking the balance between F1 and not overbooking your hotel, I could see the default threshold being selected instead. For the purposes of this assignment, I will select the threshold of 0.38 but also advice taking the risk of overbooking into account.
# setting the threshold
optimal_threshold_curve = optimal_threshold_auc_roc
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold = optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79721 | 0.72743 | 0.67262 | 0.69895 |
print("Test performance:")
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc)
Test performance:
# rebuild the data, since this is a different model
# Create Independent and Dependent Variables
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
# Add a constant to X
X = sm.add_constant(X)
# Create dummy variables for categorical columns in X
X = pd.get_dummies(X, drop_first=True)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
tree1_confmat_train = confusion_matrix_statsmodels(model, X_train, y_train)
tree1_perf_train = model_performance_classification_statsmodels(
model, X_train, y_train
)
tree1_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99421 | 0.98661 | 0.99578 | 0.99117 |
tree1_confmat_test = confusion_matrix_statsmodels(model, X_test, y_test)
tree1_perf_test = decision_tree_perf_train = model_performance_classification_statsmodels(
model, X_train, y_train
)
tree1_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99421 | 0.98661 | 0.99578 | 0.99117 |
This is VERY high F1. Not goign to visualize yet - though Ill admit I tried. After a runtime of 3 minutes I interupted it and moved on.
We can look at the important features, however:
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Many of these are not important, so time to reduce overfitting and prune the tree.
Pre-Pruning
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 1),
"max_leaf_nodes": [2, 3, 5, 10, 50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)tree2_confmat_train = confusion_matrix_statsmodels(estimator, X_train, y_train)
tree2_perf_train = decision_tree_perf_train = model_performance_classification_statsmodels(
estimator, X_train, y_train
)
tree2_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83101 | 0.78620 | 0.72428 | 0.75397 |
The new F1 of 0.75 is greater than the regression model, and drastically less than the unpruned tree.
tree2_confmat_test = confusion_matrix_statsmodels(estimator, X_test, y_test)
tree2_perf_test = decision_tree_perf_train = model_performance_classification_statsmodels(
estimator, X_test, y_test
)
tree2_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83497 | 0.78336 | 0.72758 | 0.75444 |
Good Performance on Test as well, and not a large change from the train.
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
This is still a little onweildy.
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- weights: [1736.39, 132.08] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 25.81] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- weights: [960.27, 223.16] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- weights: [129.73, 160.92] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- weights: [214.72, 227.72] class: 1 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [82.76, 285.41] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- weights: [87.23, 81.98] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- weights: [228.14, 48.58] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- weights: [363.83, 132.08] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [219.94, 85.01] class: 0 | | | | | |--- lead_time > 3.50 | | | | | | |--- weights: [132.71, 280.85] class: 1 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [158.80, 159.40] class: 1 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- weights: [850.67, 3543.28] class: 1 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- weights: [15.66, 9.11] class: 0 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- weights: [32.06, 19.74] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- weights: [498.03, 44.03] class: 0 | | | | | |--- lead_time > 4.50 | | | | | | |--- weights: [258.71, 63.76] class: 0 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2512.51, 1451.32] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [180.42, 57.69] class: 0 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [184.90, 56.17] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [106.61, 106.27] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [3.73, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [257.96, 62.24] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [213.97, 385.60] class: 1 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [23.86, 1030.80] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- weights: [7.46, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- weights: [37.28, 4.55] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [20.13, 212.54] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [231.12, 110.82] class: 0 | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [19.38, 34.92] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
# importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | 0.00838 |
| 1 | 0.00000 | 0.00838 |
| 2 | 0.00000 | 0.00838 |
| 3 | 0.00000 | 0.00838 |
| 4 | 0.00000 | 0.00838 |
| ... | ... | ... |
| 1843 | 0.00890 | 0.32806 |
| 1844 | 0.00980 | 0.33786 |
| 1845 | 0.01272 | 0.35058 |
| 1846 | 0.03412 | 0.41882 |
| 1847 | 0.08118 | 0.50000 |
1848 rows × 2 columns
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train,y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08117914389136943
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012291224171537176,
class_weight='balanced', random_state=1)
tree3_confmat_train= confusion_matrix_statsmodels(best_model, X_train, y_train)
tree3_perf_train = decision_tree_perf_train = model_performance_classification_statsmodels(
best_model, X_train, y_train
)
tree3_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.89946 | 0.90231 | 0.81297 | 0.85531 |
tree3_confmat_test= confusion_matrix_statsmodels(best_model, X_test, y_test)
tree3_perf_test = decision_tree_perf_train = model_performance_classification_statsmodels(
best_model, X_test, y_test
)
tree3_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86925 | 0.85548 | 0.76725 | 0.80897 |
Back to slightly overfitting the training data; the F1 score drops from a 0.85 to a 0.81.
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
I dont have a a good grasp on how complex is too complex. The text report is certainly easier for me to parse.
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | |--- lead_time <= 16.50 | | | | | | | | |--- avg_price_per_room <= 68.50 | | | | | | | | | |--- weights: [207.26, 10.63] class: 0 | | | | | | | | |--- avg_price_per_room > 68.50 | | | | | | | | | |--- arrival_date <= 29.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- arrival_date > 29.50 | | | | | | | | | | |--- weights: [2.24, 7.59] class: 1 | | | | | | | |--- lead_time > 16.50 | | | | | | | | |--- avg_price_per_room <= 135.00 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [21.62, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 135.00 | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | |--- weights: [1199.59, 0.00] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 25.81] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- arrival_month <= 9.50 | | | | | | | |--- avg_price_per_room <= 63.29 | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- weights: [41.75, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | |--- avg_price_per_room <= 59.75 | | | | | | | | | | |--- arrival_date <= 23.50 | | | | | | | | | | | |--- weights: [1.49, 12.14] class: 1 | | | | | | | | | | |--- arrival_date > 23.50 | | | | | | | | | | | |--- weights: [14.91, 1.52] class: 0 | | | | | | | | | |--- avg_price_per_room > 59.75 | | | | | | | | | | |--- lead_time <= 44.00 | | | | | | | | | | | |--- weights: [0.75, 59.21] class: 1 | | | | | | | | | | |--- lead_time > 44.00 | | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 63.29 | | | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | | | |--- weights: [20.13, 0.00] class: 0 | | | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | |--- arrival_month > 9.50 | | | | | | | |--- weights: [413.04, 27.33] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- avg_price_per_room <= 99.98 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- avg_price_per_room <= 62.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 62.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- lead_time <= 81.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 81.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | |--- weights: [55.17, 3.04] class: 0 | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | |--- lead_time <= 73.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- lead_time > 73.50 | | | | | | | | | | |--- weights: [21.62, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 99.98 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- avg_price_per_room <= 132.43 | | | | | | | | | |--- weights: [9.69, 122.97] class: 1 | | | | | | | | |--- avg_price_per_room > 132.43 | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- avg_price_per_room <= 75.07 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- avg_price_per_room <= 58.75 | | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 58.75 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | | | |--- weights: [2.24, 118.41] class: 1 | | | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- arrival_date <= 11.50 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- arrival_date > 11.50 | | | | | | | | | |--- weights: [29.08, 15.18] class: 0 | | | | | | |--- avg_price_per_room > 75.07 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- weights: [59.64, 3.04] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | |--- weights: [1.49, 16.70] class: 1 | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- avg_price_per_room <= 86.00 | | | | | | | | | | | |--- weights: [2.24, 16.70] class: 1 | | | | | | | | | | |--- avg_price_per_room > 86.00 | | | | | | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [44.73, 4.55] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- arrival_date <= 11.50 | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | |--- weights: [16.40, 39.47] class: 1 | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | |--- weights: [20.13, 6.07] class: 0 | | | | | | |--- arrival_date > 11.50 | | | | | | | |--- avg_price_per_room <= 102.09 | | | | | | | | |--- weights: [5.22, 144.22] class: 1 | | | | | | | |--- avg_price_per_room > 102.09 | | | | | | | | |--- avg_price_per_room <= 109.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [0.75, 16.70] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [33.55, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 109.50 | | | | | | | | | |--- avg_price_per_room <= 124.25 | | | | | | | | | | |--- weights: [2.98, 75.91] class: 1 | | | | | | | | | |--- avg_price_per_room > 124.25 | | | | | | | | | | |--- weights: [3.73, 3.04] class: 0 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- arrival_date <= 7.50 | | | | | | | |--- weights: [38.02, 0.00] class: 0 | | | | | | |--- arrival_date > 7.50 | | | | | | | |--- avg_price_per_room <= 93.58 | | | | | | | | |--- avg_price_per_room <= 65.38 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- avg_price_per_room > 65.38 | | | | | | | | | |--- weights: [24.60, 3.04] class: 0 | | | | | | | |--- avg_price_per_room > 93.58 | | | | | | | | |--- arrival_date <= 28.00 | | | | | | | | | |--- weights: [14.91, 72.87] class: 1 | | | | | | | | |--- arrival_date > 28.00 | | | | | | | | | |--- weights: [9.69, 1.52] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- no_of_adults <= 1.50 | | | | | | | |--- weights: [84.25, 0.00] class: 0 | | | | | | |--- no_of_adults > 1.50 | | | | | | | |--- lead_time <= 125.50 | | | | | | | | |--- avg_price_per_room <= 90.85 | | | | | | | | | |--- avg_price_per_room <= 87.50 | | | | | | | | | | |--- weights: [13.42, 13.66] class: 1 | | | | | | | | | |--- avg_price_per_room > 87.50 | | | | | | | | | | |--- weights: [0.00, 15.18] class: 1 | | | | | | | | |--- avg_price_per_room > 90.85 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | | |--- lead_time > 125.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- weights: [58.15, 18.22] class: 0 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- weights: [61.88, 1.52] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- avg_price_per_room <= 70.05 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 70.05 | | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [38.77, 1.52] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | | | |--- weights: [34.30, 40.99] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | |--- avg_price_per_room <= 74.21 | | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | | | |--- avg_price_per_room > 74.21 | | | | | | | | | | | |--- weights: [9.69, 0.00] class: 0 | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | |--- weights: [4.47, 10.63] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | |--- weights: [155.07, 6.07] class: 0 | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- weights: [3.73, 10.63] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- avg_price_per_room <= 202.67 | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- weights: [63.37, 30.36] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | | |--- weights: [115.56, 12.14] class: 0 | | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 24.50 | | | | | | | | | | | |--- weights: [28.33, 3.04] class: 0 | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | |--- avg_price_per_room > 202.67 | | | | | | | |--- weights: [0.75, 22.77] class: 1 | | | | | |--- lead_time > 3.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- avg_price_per_room <= 119.25 | | | | | | | | |--- avg_price_per_room <= 118.50 | | | | | | | | | |--- weights: [18.64, 59.21] class: 1 | | | | | | | | |--- avg_price_per_room > 118.50 | | | | | | | | | |--- weights: [8.20, 1.52] class: 0 | | | | | | | |--- avg_price_per_room > 119.25 | | | | | | | | |--- weights: [34.30, 171.55] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [26.09, 1.52] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 14.00 | | | | | | | | | | |--- weights: [9.69, 36.43] class: 1 | | | | | | | | | |--- arrival_date > 14.00 | | | | | | | | | | |--- avg_price_per_room <= 208.67 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 208.67 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- avg_price_per_room <= 59.43 | | | | | | | |--- lead_time <= 84.50 | | | | | | | | |--- weights: [50.70, 7.59] class: 0 | | | | | | | |--- lead_time > 84.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_date <= 27.00 | | | | | | | | | | |--- lead_time <= 131.50 | | | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | | | | | |--- lead_time > 131.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 27.00 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 59.43 | | | | | | | |--- lead_time <= 25.50 | | | | | | | | |--- weights: [20.88, 6.07] class: 0 | | | | | | | |--- lead_time > 25.50 | | | | | | | | |--- avg_price_per_room <= 71.34 | | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | | |--- lead_time <= 68.50 | | | | | | | | | | | |--- weights: [15.66, 78.94] class: 1 | | | | | | | | | | |--- lead_time > 68.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | | |--- lead_time <= 102.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 102.00 | | | | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 71.34 | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- avg_price_per_room <= 120.45 | | | | | | | | | |--- weights: [79.77, 9.11] class: 0 | | | | | | | | |--- avg_price_per_room > 120.45 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [3.73, 12.14] class: 1 | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- weights: [16.40, 47.06] class: 1 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- weights: [0.00, 63.76] class: 1 | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- avg_price_per_room <= 104.31 | | | | | | | | |--- lead_time <= 25.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | | |--- weights: [16.40, 0.00] class: 0 | | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | | |--- weights: [38.77, 118.41] class: 1 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [23.11, 0.00] class: 0 | | | | | | | | |--- lead_time > 25.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- weights: [39.51, 185.21] class: 1 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [73.81, 411.41] class: 1 | | | | | | | |--- avg_price_per_room > 104.31 | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 195.30 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- avg_price_per_room > 195.30 | | | | | | | | | | | |--- weights: [0.75, 138.15] class: 1 | | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [11.18, 6.07] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- weights: [0.75, 9.11] class: 1 | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | |--- avg_price_per_room <= 168.06 | | | | | | | | | | |--- lead_time <= 22.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 22.00 | | | | | | | | | | | |--- weights: [17.15, 83.50] class: 1 | | | | | | | | | |--- avg_price_per_room > 168.06 | | | | | | | | | | |--- weights: [12.67, 6.07] class: 0 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- lead_time <= 63.00 | | | | | | | |--- weights: [15.66, 1.52] class: 0 | | | | | | |--- lead_time > 63.00 | | | | | | | |--- weights: [0.00, 7.59] class: 1 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- lead_time <= 105.00 | | | | | | | |--- weights: [0.75, 6.07] class: 1 | | | | | | |--- lead_time > 105.00 | | | | | | | |--- weights: [31.31, 13.66] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | |--- weights: [497.28, 40.99] class: 0 | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | |--- lead_time > 4.50 | | | | | | |--- arrival_date <= 13.50 | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | |--- weights: [58.90, 36.43] class: 0 | | | | | | | |--- arrival_month > 9.50 | | | | | | | | |--- weights: [33.55, 1.52] class: 0 | | | | | | |--- arrival_date > 13.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- weights: [123.76, 9.11] class: 0 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- avg_price_per_room <= 126.33 | | | | | | | | | |--- weights: [32.80, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 126.33 | | | | | | | | | |--- weights: [9.69, 13.66] class: 1 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 118.55 | | | | | | | |--- lead_time <= 61.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | |--- weights: [70.08, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [126.74, 1.52] class: 0 | | | | | | | |--- lead_time > 61.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | |--- weights: [4.47, 57.69] class: 1 | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | |--- lead_time <= 66.50 | | | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 66.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- avg_price_per_room <= 71.93 | | | | | | | | | | | |--- weights: [54.43, 3.04] class: 0 | | | | | | | | | | |--- avg_price_per_room > 71.93 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | |--- avg_price_per_room > 118.55 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- avg_price_per_room <= 177.15 | | | | | | | | | | |--- avg_price_per_room <= 118.98 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 118.98 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | |--- avg_price_per_room > 177.15 | | | | | | | | | | |--- arrival_date <= 7.00 | | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 7.00 | | | | | | | | | | | |--- weights: [12.67, 24.29] class: 1 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- avg_price_per_room <= 121.20 | | | | | | | | | | | |--- weights: [18.64, 6.07] class: 0 | | | | | | | | | | |--- avg_price_per_room > 121.20 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- lead_time <= 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- weights: [11.93, 10.63] class: 0 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- weights: [37.28, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- avg_price_per_room <= 119.20 | | | | | | | | | | | |--- weights: [9.69, 28.84] class: 1 | | | | | | | | | | |--- avg_price_per_room > 119.20 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- lead_time <= 100.00 | | | | | | | | | | | |--- weights: [49.95, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 100.00 | | | | | | | | | | | |--- weights: [0.75, 18.22] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- lead_time <= 6.50 | | | | | | | |--- weights: [32.06, 1.52] class: 0 | | | | | | |--- lead_time > 6.50 | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | |--- weights: [103.63, 50.10] class: 0 | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | |--- weights: [44.73, 6.07] class: 0 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- avg_price_per_room <= 202.95 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | |--- weights: [1.49, 9.11] class: 1 | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- lead_time <= 150.50 | | | | | | | | | |--- weights: [175.20, 28.84] class: 0 | | | | | | | | |--- lead_time > 150.50 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | |--- avg_price_per_room > 202.95 | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | |--- arrival_month > 8.50 | | | | | | |--- avg_price_per_room <= 153.15 | | | | | | | |--- room_type_reserved_Room_Type 2 <= 0.50 | | | | | | | | |--- avg_price_per_room <= 71.12 | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 71.12 | | | | | | | | | |--- avg_price_per_room <= 90.42 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- weights: [12.67, 7.59] class: 0 | | | | | | | | | |--- avg_price_per_room > 90.42 | | | | | | | | | | |--- weights: [64.12, 60.72] class: 0 | | | | | | | |--- room_type_reserved_Room_Type 2 > 0.50 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 153.15 | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- arrival_month <= 5.00 | | | | | | | |--- weights: [2.98, 0.00] class: 0 | | | | | | |--- arrival_month > 5.00 | | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- lead_time <= 341.00 | | | | | | | |--- lead_time <= 173.00 | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | |--- weights: [46.97, 9.11] class: 0 | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | | | |--- weights: [0.00, 13.66] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | |--- lead_time > 173.00 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- weights: [188.62, 7.59] class: 0 | | | | | | |--- lead_time > 341.00 | | | | | | | |--- weights: [13.42, 27.33] class: 1 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- lead_time <= 285.50 | | | | | | | |--- weights: [8.20, 0.00] class: 0 | | | | | | |--- lead_time > 285.50 | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- arrival_month <= 11.50 | | | | | | | |--- lead_time <= 244.00 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- lead_time <= 166.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 166.50 | | | | | | | | | | | |--- weights: [2.24, 57.69] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [17.89, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- weights: [11.18, 3.04] class: 0 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [75.30, 12.14] class: 0 | | | | | | | |--- lead_time > 244.00 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [25.35, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- weights: [11.18, 264.15] class: 1 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | | | |--- arrival_month > 11.50 | | | | | | | |--- weights: [46.22, 0.00] class: 0 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- lead_time <= 324.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- weights: [7.46, 986.78] class: 1 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | |--- lead_time > 324.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- weights: [0.75, 13.66] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- weights: [1.49, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- arrival_date <= 1.50 | | | | | | | |--- weights: [1.49, 3.04] class: 1 | | | | | | |--- arrival_date > 1.50 | | | | | | | |--- weights: [35.79, 1.52] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | |--- weights: [7.46, 206.46] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- avg_price_per_room <= 76.48 | | | | | | | |--- weights: [46.97, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 76.48 | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | |--- no_of_week_nights <= 5.50 | | | | | | | | | |--- lead_time <= 233.00 | | | | | | | | | | |--- lead_time <= 152.50 | | | | | | | | | | | |--- weights: [1.49, 4.55] class: 1 | | | | | | | | | | |--- lead_time > 152.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 233.00 | | | | | | | | | | |--- weights: [23.11, 19.74] class: 0 | | | | | | | | |--- no_of_week_nights > 5.50 | | | | | | | | | |--- weights: [8.95, 16.70] class: 1 | | | | | | | |--- arrival_date > 27.50 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- weights: [2.24, 15.18] class: 1 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- lead_time <= 269.00 | | | | | | | | | | |--- lead_time <= 176.00 | | | | | | | | | | | |--- weights: [2.24, 7.59] class: 1 | | | | | | | | | | |--- lead_time > 176.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 269.00 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | |--- arrival_month > 11.50 | | | | | | |--- arrival_date <= 14.50 | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | |--- arrival_date > 14.50 | | | | | | | |--- weights: [11.18, 31.88] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
tree1_perf_train.T,
tree2_perf_train.T,
tree3_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree (Initial)",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree (Initial) | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.99421 | 0.83101 | 0.89946 |
| Recall | 0.98661 | 0.78620 | 0.90231 |
| Precision | 0.99578 | 0.72428 | 0.81297 |
| F1 | 0.99117 | 0.75397 | 0.85531 |
# testing performance comparison
models_train_comp_df = pd.concat(
[
tree1_perf_test.T,
tree2_perf_test.T,
tree3_perf_test.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree (Initial)",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree (Initial) | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.99421 | 0.83497 | 0.86925 |
| Recall | 0.98661 | 0.78336 | 0.85548 |
| Precision | 0.99578 | 0.72758 | 0.76725 |
| F1 | 0.99117 | 0.75444 | 0.80897 |
The post-pruning tree performs the best on all four metrics as compared to the pre-pruning tree.
All of these models are pretty tough to interpret compared to models explored in class. However, I am getting very good at reading the text files. Here is a depth = 3 version of our strongest tree, the post-pruning tree:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
max_depth=3 #friendly viewing, depth = 3
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
This top-level view allows us to make a few observations and buisness decisions:
Business Suggestions
Data Improvements
There seems to be a missing category of data - and that is how many stays are rebookings. I appreciate that the target variable needed to stay a binary, but there could have been another category of previous bookings: not cancelled, cancelled, and rebooked.
If any of the hotels host major events, or see spikes in volume of bookings due to surronding city events, this could be an additional categorical variable. Was the booking assosiated/correlated with a special event?
Logistic Regression Model Commentary
Decision Tree Model Commentary